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' Abstract. The Earley algorithm is a widely used parsing method in 

t , natural language processing applications. We introduce a variant of Ear- 

ley parsing that is based on a "delayed" recognition of constituents. This 
allows us to start the recognition of a constituent only in cases in which 
all of its subconstituents have been found within the input string. This 
is particularly advantageous in several cases in which partial analysis of 
a constituent cannot be completed and in general in all cases of pro- 
ductions sharing some suffix of their right-hand sides (even for different 
left-hand side nonterminals). Although the two algorithms result in the 
same asymptotic time and space complexity, from a practical perspective 
our algorithm improves the time and space requirements of the original 
method, as shown by reported experimental results. 
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1 Introduction 



Earley parsing is one of the most commonly used methods for the (automatic) 



syntactic analysis of natural language sentences, given a context-free grammar 
model. This method does not use backtracking, resulting in time and space 
efficiency, and is quite flexible, in that it does not require the input grammar 
to be cast in any particular form. Earley parsing was first defined in ^3|, in the 
context of formal language parsing. This method has later been rediscovered 
in JlO| , pd| from the perspective of application to natural language processing, 
where it was called active chart parsing. Active chart parsing makes also use of a 
data structure, called agenda, which allows a more flexible control of competing 
analyses. 



Research by the first author is carried out within the framework of the Priority Pro- 
gramme Language and Speech Technology (TST). The TST-Programme is sponsored 
by NWO (Dutch Organization for Scientific Research). 



A considerable number of results and applications regarding Earley parsing 
have been published in the literature. From a theoretical perspective, improve- 
ments of the Earley algorithm have been reported in (9), [jl5| and fig] . Several 
reformulations of Earley parsing have also been presented. Most remarkably, 
in H Earley parsing is related to the deterministic simulation of a particular 
kind of nondeterministic pushdown automaton, and a recursive reformulation of 
Earley parsing has been proposed in |]l4j| . 

From the perspective of natural language parsing, the Earley method has 
been adapted to work with context-free grammars enriched with feature struc- 
tures in [^2) , (2^] and , and to cope with on-line semantic interpretation in |^7j . 
Comparison of Earley parsing with other parsing strategies has been experimen- 
tally carried out and reported in Q and Q . 

In this paper we focus on a drawback of the Earley algorithm: the recognition 
of a production within the input is started by looking for the constituents in 
its right-hand side, proceeding from left to right. In this process, the algorithm 
keeps track of the position within the input at which the recognition has started. 
Since this information is needed only if the whole recognition can be carried to 
an end, the algorithm behaves in a rather inefficient way in several cases in 
which production recognition cannot be successfully completed. We propose a 
variant of the original method, in which the problem is solved by delaying some 
of the computation until the involved productions have been fully recognized. 
This is achieved using an idea first presented in Q in the context of left-corner 
parsing, as it will be discussed at length in the final section. When applied in the 
framework of active chart parsing, our technique results in the "inversion" of the 
fundamental rule |l0|, that combines a left active edge with a right inactive 
edge. Although our proposal does not result in an asymptotic improvement of 
the time and space complexity of the Earley algorithm, reported experimental 
results provide evidence that in practical cases our method achieves an increase 
in time and space efficiency. 

The remainder of this paper is organized as follows. In Section |^ some pre- 
liminaries are discussed. We review the Earley parsing method in Section g and 
then introduce our variant in Section ^. Some empirical results are given in Sec- 
tion ||, and related work is discussed in Section ||. 

2 Preliminaries 

We introduce the formal notation that will be used throughout the paper. 

A string w is a finite sequence of symbols over some alphabet. We denote as 
\w\ the length of w, and as e the (unique) string of length zero. The set of all 
strings over some alphabet S, e included, is denoted U* . A context-free grammar 
(CFG) is a rewriting system G = (Vt, Vn, P, S), where Vt and Vn are two finite, 
disjoint sets of terminal and nonterminal symbols, respectively, S S Vn is the 
start symbol, and P is a finite set of productions. Each production has the form 
A — > a with A € Vn and a 6 (Vn U Vt)*. The size of G, written \G\, is defined 
as E(A-> a ) £ p\ Aa \- 



We generally use symbols A,B,C,... to range over Vn, symbols a, b, c, . . . to 
range over Vt, symbols X, Y to range over Vn U Vt, symbols a, (3, 7, . . . to range 
over (Vn U Vt)*, and symbols v, w, x, . . . to range over Vt* ■ For a fixed grammar, 
the binary relation =>■ is denned over (VnUVt)* such that ^A5 =>■ -faS whenever 
A — > a belongs to P. We will mainly use the reflexive and transitive closure of 
=> , denoted => . 

3 Earley Parsing 

We briefly present here the Earley algorithm, before introducing the variant of 
this method in the next section. 

Let G = (Vt,Vn,P, S) be a CFG. We associate with G a set of symbols, 
called dotted items, specified as: 

I E = {[A->a.0\ I (A^aP)eP}. (1) 

Dotted items are used below to represent intermediate steps in the process of 
recognition of a production of the grammar, where the sequence of symbols in 
between the arrow and the dot indicates the sequence of constituents recog- 
nized so far at consecutive positions within the input string. More precisely, 
given a production p : (A — > X1X2 ■ ■ ■ X r ), r > 0, the process of recognition 
of the right-hand side of p is carried out in several steps. We start from item 
A — > 'X1X2 ■ ■ ■ X r , attesting that the empty sequence of constituents has been 
collected so far. This item represents a prediction for p. We then proceed with 
item A — > X\'X2 ■ ■ ■ X r after the recognition of a constituent X\, and so on. 
Production p has been fully recognized only if we reach item A — ► X1X2 ■ ■ ■ X r • , 
attesting therefore the complete recognition of a constituent A. In active chart 
parsing, items in Je with the dot not at the rightmost position of the right-hand 
side are used to label the so called active edges. 

Given a string w = aia 2 ■ ■ ■ a n , with n > and each aj a terminal symbol, 
we call position within w any integer i such that < i < n. In what follows, E 
is a square matrix whose entries are subsets of Ie and are addressed by indices 
that are positions within the input string. Entries are denoted as Eij. The 
insertion by the algorithm of item [A—*aff3] in Eij , i < j, attests the fact that 
the sequence of constituents in a exactly spans the substring a i+ i ■ ■ ■ aj of the 
input. (See below for a more precise characterization of the algorithm.) Control 
flow is not specified in the method below, since it is usually regulated by means 
of a data structure called agenda, which directs the incremental construction 
of the table by means of an iteration: starting from an empty table, items are 
added as long as needed, and with the desired priority. 

Algorithm 1 (Earley) Let G = (Vt, Vn, P, S) be a CFG. Let to = a x a 2 ■ ■ ■ a n 
be an input string, n > 0, and «i £ Vr for 1 < i < n. Compute the least 
(n + 1) x (n + 1) table E such that [S — > «a] G £0,0 f° r eacri {S ^> a) £ P, and 

1. [A -► .7] e if [B^ofAfl GEij, (A->7) G P; 

2. [A -> aaj'fi] G Eij if [A -> a*a,j/3] G E itj -i; 

3. [A^aB.p]£Eij ]i[A^cfB0\eE itk ,[B^y]eE ktj . 



The string w is accepted if and only if [S — > a«] € I?o,n for some (5 — > a) € P. 

The correctness of the algorithm immediately follows from the property below, 
whose proof can be found in [^| and || . 

Proposition 1. In Algorithm^, an item [A — > a m 0] is inserted in Eij if and 
only if the following conditions hold: 

Al. S <zi • • ■ aiAj, some 7; and 
A2. a => <Zi+i • • ■ dj . 

For methods cruder than the Earley algorithm, membership of an item in some 
entry may merely be subject to condition A2, which is sufficient for determining 
the correctness of the input. However, Earley's algorithm is more selective, as is 
apparent from condition Al, which characterizes the so called top-down filtering 
capability of the method. Condition Al guarantees that only those constituents 
are predicted that are compatible with the portion of the input that has been 
read so far. 

Assuming the working grammar as fixed, a simple analysis reveals that Al- 
gorithm [I] runs in time 0(n 3 )E This will be more carefully discussed in the next 
section. 

4 A Variant of Earley Parsing 

In this section we introduce a variant of Earley parsing that can be obtained by 
reconsidering the way in which the results of the intermediate steps are stored 
in the process of production recognition. 

Let us focus on the dependence of the running time of Algorithm [l] on the 
length of the input string. From this perspective, the most expensive step is 
Step 3. Intuitively, this is the case because there might be 0(n 2 ) items that 
are inserted at this step in some entry of E, and each item can in turn be the 
result of 0(n) different combinations of pairs of items already in E. In practice, 
the total number of different combinations of dotted items attempted by Step 3 
when processing an input string dominates the running time of Algorithm [j]. The 
change to the new method consists in a decomposition of Step 3 that results, in 
some cases, in a reduction of this number. We introduce the basic idea through 
an example. 

Consider a production p : (A — > A1A2 ■ ■ ■ A r ), r > 3. Let D be a set con- 
taining d > 2 positions within the input string. Assume that the dotted item 
[A — > Ai »A2 ■ ■ ■ A r ] has been inserted in the entry Ei j ± , for each i £ D and 
for some fixed j\. This corresponds to d constituents A\ recognized within the 
input. Assume also that, for each t with 2 < t < r — 1, a constituent A t has 
been recognized in entry Ej t l j t . Finally, assume that no constituent A r is found 

3 When both the input string and the grammar are taken as input parameters, Algo- 
rithm [l] runs in time 0(\G\ 2 n 3 ). An improvement of Algorithm [j] has been presented 
in running in time C(|G| n 3 ). 
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Fig. 1. We depict the case of d = 3, r — 4, and assume D = {ii, 12,13}- We represent 
the input string by means of an horizontal line and each dotted item in E by means 
of an arc; only the relevant positions within the input string are depicted. In the at- 
tempt to recognize production A — » Ai ■ • • A4, the algorithm has created 3 dotted items 
[A — > Ai • A2A3A4], one for each position in D, depicted by solid arcs above the hori- 
zontal line. Since each of these items has a different left position, the Earley algorithm 
is forced to instantiate 3 independent processes for the recognition of A — » Ai ■ ■ ■ A4. 
These processes will create the dotted items depicted by the dashed arcs. Note that 
in collecting the remaining constituents A2,Ag,A4 the method duplicates the needed 
effort. 



starting at position j r _i (see Figure 1). Under these assumptions, Step 3 will be 
executed d(r — 2) more times, carrying out d independent recognition processes 
for p, to find out at the end that none of these processes can be successfully com- 
pleted, because of the lack of constituent A r . The fact that the above recognition 
processes arc independent one of the other is due to the fact that in Step 3 we 
record the position within the input where each process started (the positions 
in D). 

We observe that the left position of p in the input string is needed only if 
the recognition process of p can be successfully completed, in order to locate 
the constituent corresponding to the left-hand side of p for use in the remaining 
analysis of the input. On this basis, we reformulate Step 3 by splitting it up 
into two substeps. The first substep performs the recognition of p in a forward 
manner, without maintaining any record of the left position. This is done using an 
array U in whose entries we store only the suffixes of p's right-hand side that must 
still be recognized. If the recognition can be successfully completed, we apply 
the second substep and compute the left positions of p in a backward manner, 
starting from the rightmost constituent in p's right-hand side and proceeding 
toward the left, storing the intermediate results in table T. 

The proposed technique thus delays part of the computation from the former 
Step 3 until we are granted that p can be successfully recognized. In this way we 
avoid the computational inefficiency revealed by our example. In fact, whenever 



p's recognition cannot be completed, no backward computation is performed 
by our method, resulting in some time and space savings. More precisely, the 
same computation performed by the d(r — 2) executions of Step 3 in Algorithm [l] 
will be performed by r — 2 executions of the forward substep, and executions 
of the backward substep. In addition we observe that, even in the presence of a 
constituent A r with left position j r -i in the input string, the proposed technique 
performs more efficiently than the original formulation of Step 3. In fact, since 
the backward substep proceeds from right to left, constituents A r , A r _i, . . . will 
be visited only once in the attempt to find all possible left positions for p. 

We observe that for the technique described above to work in its full general- 
ity, also Step 2 from Algorithm |] should be split into two substeps. This allows 
correct treatment of productions containing terminal symbols in their right-hand 
sides. Finally, it is not difficult to see that the problem described above can be 
generalized to productions sharing some suffix of their right-hand sides, that 
is productions of the form A — > and B — > /?7, in cases that 7 is, at some 
position, predicted independently for both productions. 

We are now in a position to give a precise specification of the proposed 
parsing algorithm. Let G — (Vr, Vn, P, S) be a CFG. We associate with G a set 
of symbols, called suffix items, specified as: 

Iy = {[/?] I (A^af3)eP). (2) 

Suffix items serve two different purposes. First, the insertion of suffix item [a] in 
entry Uj , where U is a one-dimensional array, means that the process of forward 
recognition of a production A — » a/3, for some A and a, has been successfully 
carried out, up to position j and up to the constituents in the sequence a. In other 
words, there exists at least one i, i < j, such that some dotted item [A — > a»/3] 
would have been inserted in £Iy by Algorithm Second, the insertion of suffix 
item [p] in Tjj means that at least one production A — > aft, for some A and 
a, has been completely recognized and the constituents in the sequence j3 have 
been collected backwards so far, spanning the substring aj+i 

Algorithm 2 (Variant of Earley) Let G = (Vr, V N , P, S) be a CFG. Let to = 

a\ai ■ ■ ■ a n be an input string, n > 0, and G Vr for 1 < i < n. Compute the 
least (n + 1) X (n + 1) table T and the least (n + 1) array U such that [a] £ Uq 
for each (S a) £ P, and 

1. Met/, if [Af3] £ U,, (A -> 7 ) £ P; 

2. [/3]e[/, if [ aj 0\ £ Uj- X \ 

3. \0\eU, i£[BP]£U k ,(B^7)£P,fr}£T k:j ; 

4. [e] G T m , m if [e] e C/ m ; 

5. [a.j/3] £ r 3 -_i )tn if [oj/3] e C7j-i, [/?] G T iiTn ; 

6. [B/3]eT fc , m if [B/?] £U k , (S^ 7 ) GP, [7] GT fcij) {(3\ G T 3 , m . 
The string u; is accepted if and only if [a] £ Tq^„ for some (S — > a) G P. 

Step 1 of Algorithm |l] exactly corresponds to Step 1 of Algorithm H. Step 2 of 
Algorithm |^ has now been split into Steps 2 and 5 of Algorithm [2^ which act 



as forward and backward substeps, respectively. Similarly, Step 3 of Algorithm |lj 
has been split into Steps 3 and 6 of Algorithm ||. Step 4 of Algorithm || is needed 
to initiate the backward process of recognizing a production, after the forward 
process has completed recognition of the right-hand side. 

The correctness of the method directly follows from the property stated be- 
low, which characterizes the presence of suffix items in entries of U and T. 

Proposition 2. In Algorithm^, an item [0\ is inserted in Uj if and only if the 
following conditions hold: 

Al. S =>• a\ ■ ■ ■ diA^f, some i, A and 7; 
A2. (A — ► a(3) G P, some a; and 
A3, a aj+i ■ • ■ dj, 

and an item \B] is inserted in Tj^ if and only if the following conditions hold: 

Bl. the conditions Al, A2 and A3 hold; and 
B2. (5 =>- Oj + i • • • a.fc. 

The proof of the above statement is similar to that of Proposition [l| 

It is not difficult to see that Algorithm [| has running time CHn 3 ) (again, we 
assume the working grammar is fixed). Therefore Algorithms [l] and y present 
the same asymptotic time complexity. For the purpose of more carefully com- 
paring the two algorithms, we give below an alternative to Proposition |[ which 
characterizes the entries in U and T in terms of the entries in E. 

Proposition 3. In Algorithm ^, an item [[3] is inserted in Uj if and only if the 
following condition holds: 

Al. at least one item [A — > a* 0\ is inserted in Ei_j by Algorithm^, for some A, 
a and i, 

and an item [j3] is inserted in Tj t k if and only if the following conditions hold: 

Bl. the condition Al holds; and 
B2. j3 =>■ Oj+i • • ■ a*;. 

This proposition clearly shows that the number of items in U is always smaller 
than the number of items in E: several items [^4 — > a«/3] in Eij for fixed j but 
differing A, a and i correspond to one single item [f3] in Uj. 

On the other hand, the number of items in T may be larger than the number 
of items in E since for each [A — * «•/?] in Eij we may have [j3\ in several Tjk 
for distinct values of k. Since there may be up to n such k in the worst case, the 
number of items in T may be up to n times larger than the number of items in 
E. 

One example of a CFG were this phenomenon is apparent is the following. 
S -> AB A^ C B -> C C -> aC C e 



For input a™, some n, Algorithm |l| computes n + 1 items of the form [S — > 
A'B] G Eo,,, < i < n, and n + 1 items of the form [S — > ,45 •] G i? ,j, 
< j < n.. On the other hand, Algorithm || computes " 2 1 "" items of the form 
[B] eTij, < i <j < n. 

We define \E\ = S id \E id \, \U\ = S l \U i \, \T\ = S itj \T id \, and summarize the 
above as follows. 

Proposition 4. For a fixed CFG and input of length n, let E be constructed by 
Algorithm^ and U and T by Algorithm^. Then: 

1. \U\ < \E\; and 

2. \T\<n - \E\. 

The second part of this proposition seems to suggest that the table size may be 
much larger for the variant. The empirical data presented by the next section 
however show that such worst-case behaviour does not seem to occur for the 
practical grammars at hand. 

Based on the number of items that are stored in the respective tables, we 
can investigate the number of steps that are performed by the two algorithms. 
We count the number of elementary parsing steps consisting in the derivation 
of one item in a table from one or more objects, such as productions, input 
symbols, or other items in a table. For example, in the case of Algorithm || every 
combination of four objects of the form [BP] G Uk, (B — *• 7) € P, [7] 6 Ttj, and 
[0] G Tj,m is counted as one elementary parsing step according to Step 6. For 
a certain CFG and input, let us denote the number of applications of Steps 1, 
2 and 3 of the Ear ley algorithm by £\ 1 £2 and £3. Similarly, we introduce the 
notation Vi, . . . , Vq for the six steps of the variant. We further define £ = £1 + 
£2+£ 3 + \{a I (S-*a) eP}|,andV = Vi+V 2 + --- + V 6 + |{a | (S -> a) e P}\. 

Based on condition A 1 in Proposition ||[ we may conclude that Vi < £x, 
V2 < £2 and V3 < £3. The number of applications of Step 4 is bounded by the 
number of items [e] £ Uj , which is bounded by the number of items [A — » 7 • ] G 
Eij. This in turn is bounded by the number of items [A — > »j] G E^ times the 
number of j such that 7 => a.; + i • • • a-j. The number of such j is bounded by n+1, 
and the number of [A — > *7] G Ei : i is bounded by £\ plus |{a | (S — » a) G P}\. 
Therefore we have V 4 < (n + 1) ■ {£1 + \{a \ (S -> a) G P}\) 

Steps 5 and 6 cannot be applied more than once for each application of 
Steps 2 and 3 and [j3] G Tj,,„, for at most n + 1 different values of to. Therefore 
we have V 5 < (n + 1) ■ V 2 < (n + 1) ■ £ 2 and V 6 < (n + 1) • V 3 < (n + 1) • £ 3 . 

Combining the above, we obtain: 

Proposition 5. For fixed CFG and input of length n, we have V < (n + 2) • £ . 

In the worst case, the number of steps for the variant may thus be greater than 
the number of steps for the original Earley algorithm by a factor which is 0(n). 
Again, the data presented by the next section suggest that this consideration 
has little bearing on practical cases. 



5 Empirical Results 



We have performed some experiments with Algorithms |l| and |2| for four practical 
context-free grammars. 

The hrst grammar generates a subset of the programming language 
ALGOL 68 psfl . The second and third grammars generate fragments of Dutch, 
and are referred to as the CORRie grammar p^] and the Deltra grammar p3|] , re- 
spectively. These grammars were stripped of their arguments in order to convert 
them into context-free grammars. The fourth grammar, referred to as the Alvey 
grammar [|] , generates a fragment of English and was automatically generated 
from a unification-based grammar. 

The test sentences have been obtained by automatic generation from the 
grammars, using a random generator to select productions, as explained in JljJ; 
therefore these sentences do not necessarily represent input typical of the ap- 
plications for which the grammars were written. Table 1 summarizes the test 
material. 



G=(Vt,Vk,P,S) 


\G\ 


|Vn| 


|P| 


\w\ 


Parses 


ALGOL 68 


783 


167 


330 


13.7 


2.6 * 10° 


CORRie 


1141 


203 


424 


12.3 


2.3 * 10 14 


Deltra 


1929 


281 


703 


10.8 


1.1 * 10 73 


Alvey 


5072 


265 


1484 


10.7 


3.2 * 10 4 



Table 1. The test material: the four grammars and some of their dimensions, the 
average length of the test sentences (20 sentences of various lengths for each grammar), 
and the average number of parses per sentence (excluding parses containing cycles, i.e. 

subderivations of the form A ^> A). 





Earley 


Variant 


T2 + Earley 


G 


£ 


\E\ 


V 


\u\ 


\T\ 


\U\+\T\ 


£ 


\E\ 


ALGOL 68 


2,062 


1,437 


2,054 


1,302 


119 


1,421 


2,107 


1,483 


CORRie 


19,164 


8,361 


15,492 


3,498 


2,746 


6,244 


17,450 


8,751 


Deltra 


60,849 


12,694 


34,238 


4,759 


4,071 


8,830 


57,582 


15,114 


Alvey 


47,562 


6,304 


27,786 


5,398 


180 


5,578 


47,552 


6,314 



Table 2. Dynamic requirements: average time and space per sentence. 

Our implementation is merely a prototype, which means that absolute dura- 
tion of the parsing process is little indicative of the actual efficiency of more so- 
phisticated implementations. Therefore, our measurements have been restricted 
to implementation-independent quantities, viz. the number of elements stored in 
the parse table and the number of elementary steps performed by the algorithm. 
In a practical implementation, such quantities will strongly influence the space 



and time complexity, although they do not represent the only determining fac- 
tors. Furthermore, all optimizations of the time and space efficiency have been 
left out of consideration. 

In our experiments we have also considered an alternative way of introduc- 
ing suffix items \(3\ (albeit only those with |/3| > 2) into the parsing process, 
namely by first applying a grammar transformation T2, and then executing Al- 
gorithm hi as usual. This was motivated by the literature on covers HID, which 



shows that some complicated parsing algorithms can be simulated by means of 
grammar transformations and simpler parsing algorithms. We have not found 
any way to completely simulate Algorithm || in this manner, but the follow- 
ing transformation captures some of its behaviour .□ For an arbitrary grammar 
G = (V T ,V N ,P,S), we define r 2 (G) = (V T , V N U J v , P' , S), where P' contains 
the following productions: 

A -> X[a] for all (A -> Xa) e P with \a\ > 1; 
A — > a for all (A ^ a) e P with \a\ < 2; 

[Xa] — > X[a] for all [Xa] € Jv with |a| > 1; 
[XY] -> XY for all [XY] G J v . 
Note that the transformed grammar is in two normal form, which means that 
the length of right-hand sides of productions is at most 2. 

Table 2 presents the costs of parsing the test sentences. These data show 
that there is a significant gain in space and time efficiency in moving from 
Algorithm [l] to Algorithm ^. The biggest improvement in the number of parsing 
steps is observed in the case of the Alvey grammar, where it amounts to a 
decrease by over 41%. The biggest improvement in the total number of items 
stored in the tables occurs for de Deltra grammar, where it amounts to a decrease 
by over 30%. Only for individual sentences for ALGOL 68 was there an increase 
in time and space, by at most 1.2% and 0.2%, respectively. 

In the case of ALGOL 68 and Alvey, it is striking that T is so much smaller 
than U and E. This may be explained by the relatively low level of ambiguity, as 
compared to the other two grammars (see Figure 1). Both the Ear ley algorithm 
and its variant predict many productions in the form of items in U and E, but 
only a limited number of these productions will be recognized in their entirety, 
resulting in items in T. Although less striking in these cases, we see that also 
for CORRie and Deltra T is smaller than U. This suggests that the potential 
undesirable behaviour of the variant with regard to the original Earley algorithm, 
as discussed in the previous section, does not occur in practice. 

The approach using the grammar transformation is not competitive with 
the other two approaches. Although the number of steps is sometimes slightly 
smaller than in the case of Algorithm ^, the space requirements are larger in all 
cases. 



Algorithm ^ avoids any use of items of the form [A — ► X • YY. The same cannot be 
achieved by means of a grammar transformation and Algorithm hi An alternative would 
be to apply some other kind of tabular algorithm to the transformed grammar. See e.g. 



6 Concluding Remarks 



We have presented a variant of the Earley algorithm and have discussed cases in 
which it achieves space and time savings with respect to the original algorithm. 
Our variant is based on the following two main ideas. First, we do not compute 
left positions of productions until we are granted that production recognition 
can be completed within the input. Second, we only use suffix items as defined 
in (|). 

The idea of dropping left positions of productions has first been proposed 
by pHI ) where a functional realization of left-corner parsing is presented. This 
idea was rediscovered by || and expressed in a more direct way, using a table 
similar to our table U. 

The idea of using suffix items has also been proposed in p3| . It has later been 
rediscovered by ||. It was also applied to LR parsing in ]20[ . In the literature 
on chart parsing, e.g. in g], one sometimes also finds a weaker form of this idea, 
where the set of items used in labeling edges is Iq = { [A — > 0\ | (A — > a/3) 6 P}. 
One observes that, with respect to items [A — > a>/3] from 7e, the a is omitted 
as in the case of Jy, yet the left-hand side A is retained. If this idea is not 
combined with the idea of dropping left positions, then the benefit of this is 
limited to grammars containing many pairs of productions of the form A — > ot(3 
and A — ► 7/3, with a ^ 7. The idea of using suffix items is related to the difference 
between two kinds of Earley parsing for the ID /LP formalism: in |25| the items 
are of the form [A — > a • /3L where a is a string of constituents and /3 is a set 
of constituent, whereas in Q, both a and (3 are sets. This allows representation 
of several items according to |25| by a single item according to Q, as has been 
argued in jl7| Section 9.2]. 

The ideas above rely on productions or items having some suffix in common. 
Alternatively, one can investigate optimizations that rely on productions that 
have prefixes in common pq| . 
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